Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 4, 2025

Mirrored from ggml-org/llama.cpp#16636

This PR adds for cache_a and cache_b to load an additional vec2, and increases BK=32 for non-CM mul_mm.comp

Performance Comparison (Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti
Kernel Before(us/run) After(us/run) Δ %
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5767.79 5176.01 +10.26%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5355.88 4105.95 +23.34%
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5219.90 5432.22 -4.07%
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2722.40 2732.62 -0.38%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2743.99 2753.02 -0.33%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2843.99 2850.78 -0.24%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2840.88 2841.73 -0.03%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2853.15 2857.24 -0.14%
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4327.78 4334.87 -0.16%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4306.28 4289.52 +0.39%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4751.79 4781.23 -0.62%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4748.76 4785.89 -0.78%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5155.43 5164.14 -0.17%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4900.78 4914.74 -0.28%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4318.07 4371.76 -1.24%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4643.73 4815.24 -3.69%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5250.76 5015.61 +4.48%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4348.33 4388.21 -0.92%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4821.34 4570.77 +5.20%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5646.37 5633.01 +0.24%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4229.37 4240.83 -0.27%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4339.20 4358.97 -0.46%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4724.33 4779.14 -1.16%
Performance Comparison (Without coopmat and coopmat2) AMD Radeon RX 7800 XT
Kernel Before(us/run) After(us/run) Δ %
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 8873.61 5853.29 +34.04%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 6458.76 5747.87 +11.01%
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 7124.22 7401.83 -3.90%
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 3289.51 3318.63 -0.89%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 3499.61 3527.61 -0.80%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 3424.27 3446.08 -0.64%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 3707.70 3732.88 -0.68%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 3747.02 3767.69 -0.55%
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 6160.74 6393.07 -3.77%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5936.61 6047.77 -1.87%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 7717.80 7037.06 +8.82%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 8219.73 8849.61 -7.66%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 7289.05 7447.10 -2.17%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 7668.33 6923.90 +9.71%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5797.82 5618.78 +3.09%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5764.74 5403.05 +6.27%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5695.78 5998.68 -5.32%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 6074.55 5980.28 +1.55%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5571.36 5367.69 +3.66%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5704.28 5651.10 +0.93%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 6416.39 5307.34 +17.28%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5968.62 5845.84 +2.06%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 8289.75 7982.64 +3.70%

Signed-off-by: Stefan Savic <[email protected]>
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of llama.cpp versions 478f0f77-4a6a-4412-93cb-65be885f7ec3 vs a98c0b17-e20d-4b11-8978-6d6d10c53020, the changes represent Condition 1: No meaningful performance impact.

Overview

The analysis reveals minimal performance variations with no functional code modifications in core inference components. The highest measured change was a 0.17% throughput increase in a C++ standard library constructor, while critical LLM functions remain unchanged.

Key Findings

Performance Metrics:

  • Highest Response Time Change: std::vector<llm_bigram_spm>::pop_back() improved by 0.10% (0.067 ns reduction from 67 ns to 67 ns)
  • Highest Throughput Change: std::_Optional_base constructor degraded by 0.17% (0.040 ns increase from 24 ns to 24 ns)

Core Function Impact:
No changes detected in critical inference functions:

  • llama_decode() - unchanged
  • llama_encode() - unchanged
  • llama_tokenize() - unchanged
  • llama_model_load_from_file() - unchanged

Tokens Per Second Impact:
Zero impact on inference performance. The measured changes occur in auxiliary STL functions unrelated to the tokenization/inference pipeline. Core functions responsible for token processing remain identical between versions.

Power Consumption Analysis:
Negligible power consumption changes across all 15 binaries:

  • libllama.so: 280,665 nJ (effectively unchanged)
  • llama-cvector-generator: 314,116 nJ (effectively unchanged)
  • All GGML libraries maintain identical power profiles

Flame Graph & CFG Analysis:

  • Identical assembly code: Byte-for-byte identical instructions in analyzed functions
  • No structural changes: Control flow graphs show identical branching patterns
  • Performance variance: The 0.06 ns improvement reflects measurement noise rather than code optimization

GitHub Code Review (PR #78):
The actual code changes target Vulkan GPU compute shaders for matrix multiplication optimization, completely separate from the CPU-based functions showing performance variations. PR #78 introduces:

  • Enhanced F32/F16 matrix operations with up to 34% GPU performance improvements
  • No impact on CPU inference pipeline or tokenization components

Conclusion:
The measured performance differences represent normal measurement variance rather than functional improvements. Core LLM inference capabilities remain unchanged, with zero impact on production workloads.

@DajanaV DajanaV force-pushed the main branch 27 times, most recently from 44faeaa to d7421a0 Compare November 8, 2025 09:08
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 9d00b69 to c481809 Compare December 10, 2025 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants